Chapter 5 Step 1: soil data preparation

5.1 Format requirements of soil data

Soil data consist of measurement at a specific geographical location, time and soil depth. Therefore, it is necessary to arrange the data following the format shown in Table 5.1.

Table 5.1: Example format of a database.
Profile ID Horizon ID Lat Long Year Top Bottom cec ph clay silt sand soc bd
1 1_1 12.12346 1.123456 2018 0 20 15 6.5 35 58 7 3.4 1.31
1 1_2 12.12346 1.123456 2018 20 40 19 7.1 42 48 10 2.1 1.32
2 2_1 23.12346 2.123456 2019 0 30 14 5.5 12 53 35 2.9 1.39
Note:
Profile ID = unique profile identifier, Horizon ID = unique layer identifier, Lat = latitude in decimal degrees, Long = longitude in decimal degrees, Year = sampling year, Top = upper limit of the layer in cm, Bottom = lower limit of the layer in cm, cec = Cation Exchange Capacity (cmol_c/kg), ph = pH in water, clay = Clay (g/100g soil), silt = Silt (g/100g soil), sand = Sand ((g/100g soil), soc = Soil Organic Carbon (g/100g soil), bd = Bulk Density (g/cm3).

In this chapter, step-by-step instructions are given on how to carry out all these steps in RStudio. Instructions are given on how to:

  1. Generate user-defined variables,
  2. Set the working directory and load necessary packages,
  3. Import national data to R Studio
  4. Select useful columns
  5. Perform a quality check of the data
  6. Estimate bulk density using PTF
  7. Harmonize soil layers (using splines)
  8. Plot and save the formatted soil data

Thus, the instructions also serve as a very basic introduction to the functioning of R and RStudio. Still, in case for further information there is a vast amount of websites that offer help and or information on R and RStudio.

5.1.1 Pre-processing of soil data

Soil data usually require a pre-processing step to solve common issues such as, arranging the data format, fixing soil horizon depth consistency, detecting unusual soil property measurements, among others issues. Here, common issues and examples are given on how to carry out some basic data handling steps in RStudio.

# remove NAs

# remove duplicates


# select and filter


# rename

# change class of columns / check for class

Once the original dataset is clean and consistent, data harmonisation is needed to produce synthetic horizons (such as 0–30 cm layer), as well as to make compatible measurements from different lab methods. Horizon harmonisation will be done with the mass preserving spline function (Bishop, McBratney and Laslett, 1999; Malone et al., 2009) fitted to each individual soil profile, which requires more than a layer per profile.

5.2 Study area and soil data

The study area is located in the southeast of the Pampas Region, in Argentina, from the foothills of the Ventania and Tandilia hill systems, until the southern coasts of the Buenos Aires Province. To illustrate the different processes of this Technical Manual, we use three datasets from this region:

  • Gereferenced topsoil data
  • Non-georreferenced topsoil data
  • Soil profile data

5.2.1 Georeferenced topsoil data

These data were collected in 2011 by the National Institute of Agriculture Technology and Faculty of Agricultural Science of the National University of Mar del Plata (Unidad Integrada INTA-FCA) to map the status of soil nutrients in the Argentinian Pampas (Sainz Rozas et al., 2019). The dataset is a subset with 118 locations of topsoil samples (0-20 cm) that contains measurements of Organic Matter (om), P Bray (p_bray), pH (ph), and K (k) (Table 5.2 and is shown in the following map as points. This dataset is used in Chapter 7 for mapping K.

library(tidyverse)
library(sf)
library(mapview)

mapviewOptions(fgb = FALSE)
data <- read_csv("Digital-Soil-Mapping/01-Data/data_with_coord.csv")
kable(booktabs = T, data[1:10,], col.names = gsub("[.]", " ", names(data)), caption = 'Dataset with coordinates for mapping nutrients.', format = 'html', digits = c(0,6,6,2,2,2,1)) %>%
kable_classic(full_width = F) %>%
  #kable_styling(latex_options = 'striped', font_size = 10) %>%
footnote(general = "Only the ten first rows are shown.")
Table 5.2: Dataset with coordinates for mapping nutrients.
LabID x y om ph p_bray k
51 -61.51282 -37.37646 4.03 6.50 20.40 852.2
60 -57.84725 -37.85136 5.53 6.05 10.52 769.6
64 -58.87620 -38.54000 4.85 6.30 15.87 992.4
67 -60.30394 -38.45300 3.13 6.30 20.85 740.2
68 -60.39772 -38.51567 2.92 6.16 13.54 724.8
69 -60.41442 -38.52914 2.15 6.65 46.17 699.0
74 -60.00556 -38.76500 4.12 6.27 20.94 518.6
75 -60.10750 -38.76472 4.24 5.84 26.82 450.2
77 -60.17139 -38.79278 3.35 6.57 22.56 858.8
78 -60.03111 -38.74611 3.53 6.49 20.09 662.9
Note:
Only the ten first rows are shown.
s <- st_as_sf(data, coords = c("x", "y"), crs = 4326)
mapview(s, zcol = "p_bray", cex = 2.5, lwd = 0)

5.2.2 Non-georeferenced topsoil data

The second dataset consist of topsoil samples (0-20cm) of P Bray (p_bray) without geographical coordinates collected during 2005 by Sainz Rozas, Echeverria and Angelini (2011) to map the status of soil nutrients in the Argentinian Pampas. Each sample is linked to an administrative unit (Partido) of the Buenos Aires Province. This dataset is used to illustrate a case study where The following map shows the districts with the sample size (n).

library(tidyverse)
library(sf)
library(mapview)
mapviewOptions(fgb = FALSE)
data <- read_csv("Digital-Soil-Mapping/01-Data/Buenos_Aires_sur_P_Bray.csv")
data
## # A tibble: 4,552 × 5
##    Partido Cod_Prov Cod_Part stratum p_bray
##    <chr>   <chr>    <chr>      <dbl>  <dbl>
##  1 Azul    06       049         6049    8  
##  2 Azul    06       049         6049   13.9
##  3 Azul    06       049         6049    5.7
##  4 Azul    06       049         6049    8.4
##  5 Azul    06       049         6049    8.8
##  6 Azul    06       049         6049    9.7
##  7 Azul    06       049         6049   11.8
##  8 Azul    06       049         6049    7.6
##  9 Azul    06       049         6049   10.6
## 10 Azul    06       049         6049    6.3
## # … with 4,542 more rows
## # ℹ Use `print(n = ...)` to see more rows
district <- st_read("Digital-Soil-Mapping/01-Data/district.shp", quiet = T)
mapview(district, zcol = "n", lwd = 0)

5.2.3 Soil profile data

The third dataset belongs to the Soil Information System of Argentina (SISINTA, Olmedo, Rodriguez and Angelini (2017)) which contains soil profiles collected from the sixties to recently years for soil survey purposes.

library(tidyverse)
library(sf)
library(mapview)
mapviewOptions(fgb = FALSE)
data <- read_csv("Digital-Soil-Mapping/01-Data/soil_profile_data.csv")
s <- data %>% filter(top==0)
s <- st_as_sf(s, coords = c("x", "y"), crs = 4326)
mapview(s, zcol = "k", cex = 2.5, lwd = 0)

5.3 Preparatory steps

Now, let’s open RStudio. Whenever starting to work on a project or task, it is necessary to set the working directory (WD). The WD is the folder path that is used by R to save the output, for instance a plot or a table that was generated while working in R. Thus, the WD is central since it dictates where the files and calculations can be found afterwards. As it is so important, there are multiple ways of setting the WD. One option is to right click on ‘Session’ menu > ‘Set working directory …’ and select either ‘To Source File Location’ (then the WD corresponds to the file path where the Script is saved to) or ‘Choose Directory…’. Then, the user can browse to the folder that should be the WD. In this manual we propose an alternative way that allows for more customization and flexibility since sometimes multiple WDs are needed to for instance save the final map in a different folder than the covariates. Therefore, we assign the file path that represents the WD file path to an R object. This is done by defining a character value (in this case the file path) on the right side of the arrow (<-) and name the R object on the left side (wd) (see code). Once this is done we use the function ‘setwd()’ to set the WD to the file path that is specified in the object wd.

Before working on specific tasks in R, it is necessary to load the packages that contain the functions that are going to be used. Before one can load packages, they need to be installed. There are many ways to do this, yet the most common way is to call the function install.packages(). For instance, to install the terra package, one has to write install.packages("terra"). This installs the package from CRAN. However, there are a few exceptions where development versions of R packages are required. In these instances additional packages such as devtools or remotes are needed (see example in code below). These packages are then able to install packages from for instance GitHub repositories.

# 0 - User-defined variables ===================================================
wd <- 'C:/Users/hp/Documents/GitHub/Digital-Soil-Mapping'
#wd <- "C:/GIT/Digital-Soil-Mapping"

# 1 - Set working directory and load necessary packages ========================
setwd(wd) # change the path accordingly

library(tidyverse) # for data management and reshaping
library(readxl) # for importing excel files
library(mapview) # for seeing the profiles in a map
library(sf) # to manage spatial data (shp vectors) 

# install.packages("devtools") 
# devtools::install_bitbucket("brendo1001/ithir/pkg") #install ithir package
library(ithir) # for horizon harmonization

The next step is to load the national soil data into R Studio. For that, it is recommendable to have the data in either Microsoft Excel format (.xlsx) or as comma separated value table (.csv). In both cases, each row represent a sample (or horizon) and each column represent a variable. Then, the datasets can be loaded from the specified folder using the respective functions specified in the code below. It is noteworthy that in R datasets also need to be assigned to a user-defined variable in order to be saved in the “global environment”.

After reading in the file, the package tidyverse comes into play. By using the select() and unique() functions, the user can select only the necessary columns from the table and ensure that no duplicates are included. At this point it may be necessary to rename certain columns, as shown for the Profile and Horizon ID columns in the code below. Finally, every time new datasets are loaded into R, it is recommendable to check the data. Using the summary() function, users can see the class of each variable (= column) and descriptive statistics (for numerical variables). Classes are ‘character’ (chr) for text, integer (int) for whole numbers, and numeric (num) for numeric variables.

# 2 - Import national data =====================================================
# Save your national soil dataset in the data folder /01-Data as a .csv file or 
# as a .xlsx file

## 2.1 - for .xlsx files -------------------------------------------------------
# Import horizon data 
# hor <- read_excel("01-Data/soil_data.xlsx", sheet = 2)
# # Import site-level data
# site <- read_excel("01-Data/soil_data.xlsx", sheet = 1)

## 2.2 - for .csv files --------------------------------------------------------
# Import horizon data 
hor <- read_csv(file = "Digital-Soil-Mapping/01-Data/soil_profile_data.csv")
site <- select(hor, id_prof, x, y, date) %>% unique()
hor <- select(hor, id_prof, id_hor, top:cec)

# change names of key columns
names(site)
## [1] "id_prof" "x"       "y"       "date"
names(site)[1] <- "ProfID"
names(hor)
##  [1] "id_prof" "id_hor"  "top"     "bottom"  "ph_h2o"  "k"       "soc"     "clay"    "bd"      "cec"
names(hor)[1] <- "ProfID"
names(hor)[2] <- "HorID"
# scan the data
summary(site)
##      ProfID           x                y               date           
##  Min.   :  51   Min.   :-61.64   Min.   :-38.81   Min.   :1969-05-29  
##  1st Qu.:6511   1st Qu.:-60.40   1st Qu.:-37.93   1st Qu.:1970-08-26  
##  Median :7092   Median :-59.28   Median :-37.54   Median :1971-12-10  
##  Mean   :6169   Mean   :-59.40   Mean   :-37.54   Mean   :1976-02-26  
##  3rd Qu.:7383   3rd Qu.:-58.40   3rd Qu.:-37.10   3rd Qu.:1981-06-07  
##  Max.   :8128   Max.   :-57.55   Max.   :-36.56   Max.   :2011-11-17
summary(hor)
##      ProfID         HorID            top             bottom           ph_h2o            k               soc              clay             bd      
##  Min.   :  51   Min.   :12230   Min.   :  0.00   Min.   :  5.00   Min.   : 5.00   Min.   : 0.200   Min.   : 0.020   Min.   : 2.20   Min.   :0.87  
##  1st Qu.:6512   1st Qu.:29161   1st Qu.: 15.00   1st Qu.: 28.00   1st Qu.: 6.70   1st Qu.: 1.400   1st Qu.: 0.250   1st Qu.:20.90   1st Qu.:1.16  
##  Median :6948   Median :31464   Median : 35.00   Median : 55.00   Median : 7.40   Median : 1.900   Median : 0.720   Median :27.20   Median :1.26  
##  Mean   :6166   Mean   :28491   Mean   : 42.67   Mean   : 62.14   Mean   : 7.59   Mean   : 1.994   Mean   : 1.457   Mean   :28.34   Mean   :1.26  
##  3rd Qu.:7385   3rd Qu.:33766   3rd Qu.: 65.00   3rd Qu.: 88.00   3rd Qu.: 8.50   3rd Qu.: 2.500   3rd Qu.: 2.360   3rd Qu.:34.25   3rd Qu.:1.38  
##  Max.   :8128   Max.   :37674   Max.   :190.00   Max.   :230.00   Max.   :10.30   Max.   :16.800   Max.   :19.000   Max.   :66.20   Max.   :1.49  
##                                                                   NA's   :318     NA's   :331      NA's   :358      NA's   :366     NA's   :1804  
##       cec       
##  Min.   : 2.40  
##  1st Qu.:19.70  
##  Median :24.90  
##  Mean   :25.38  
##  3rd Qu.:30.30  
##  Max.   :66.60  
##  NA's   :354

The selection of useful columns is very important since it ensures that users keep a good overview and a clean environment. Using the select() function, it is also possible to rename the variables right away (see code below).

# 3 - select useful columns ====================================================
## 3.1 - select columns --------------------------------------------------------
hor <- select(hor, ProfID, HorID, top, bottom, ph=ph_h2o, k, soc, clay, bd, cec)

# the variable ph_h2o was renamed as ph

5.4 Data quality check

Datasets need to be checked for their quality as especially manually entered data is prone to mistakes such as typos or duplicates. A thorough quality check ensures that:

  • all profiles have reasonable coordinates (within the area of interest);
  • there are no duplicated profiles; and
  • the depth logic within a profile is not violated.

To check the first point, the dataframe needs to be converted into a spatial object using the st_as_sf() function of the sf package. It is necessary to indicate the columns that contains latitude and longitude, as well as a coordinate reference system (CRS). We recommend WGS84 which corresponds to an EPSG code of 4326. However, locally more appropriate CRS can be found on the following website: https://epsg.io/. The mapview() command (from mapview package) offers the possibility to visualize the profile locations in an interactive map. Finally, the filter() function can be used to remove rows that contain profiles with wrong locations.

To visualize the profile locations, the soil data table was converted into a shapefile. Still, to check whether the database complies with the depth logic within each profile, it is necessary to convert the data table into a so-called soil profile collection that allows for very specific operations. These operations were bundled in the package aqp (AQP = Algorithms for Quantitative Pedology) (Beaudette, Roudier and O’Geen, 2013). With the first lines of code below, the dataset is converted into a soil profile collection and profiles and horizon tables are joined based on the site information. Now the profile collection can be visualised for any soil property. In this case, only the first 20 profiles are selected for the cation exchange capacity (CEC). Using the checkHzDepthLogic() function, users can assess that all profiles do not have gaps or overlaps of neighbouring horizons. If there are, they can be selected and checked through the Profile ID. In the following step, only profiles with valid horizon logic are selected. Finally, the soil profile collection is re-converted to a dataframe. With this, the quality check is finished.

## 4.2 - Convert data into a Soil Profile Collection ---------------------------
library(aqp)
depths(hor) <- ProfID ~ top + bottom
hor@site$ProfID <- as.numeric(hor@site$ProfID)
site(hor) <- left_join(site(hor), site)
## Joining, by = "ProfID"
profiles <- hor

profiles
## SoilProfileCollection with 357 profiles and 1813 horizons
## profile ID: ProfID  |  horizon ID: hzID 
## Depth range: 5 - 230 cm
## 
## ----- Horizons (6 / 1813 rows  |  10 / 11 columns) -----
## # A tibble: 6 × 10
##   ProfID hzID    top bottom HorID    ph     k   soc  clay    bd
##    <dbl> <chr> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    154 1         0     14 28425   6     2.2  3.62  22.9    NA
## 2    154 2        14     26 28426   6     1.9  2.84  22.9    NA
## 3    154 3        26     44 28427   6.4   2.5  1.06  36.9    NA
## 4    154 4        44     56 28428   6.7   2.2  0.46  24.6    NA
## 5    154 5        56    105 28429   6.5   1.8  0.16  18.6    NA
## 6    197 6         0     13 28588   5.9   2.8  3.35  27.3    NA
## [... more horizons ...]
## 
## ----- Sites (6 / 357 rows  |  4 / 4 columns) -----
## # A tibble: 6 × 4
##   ProfID     x     y date      
##    <dbl> <dbl> <dbl> <date>    
## 1    154 -58.7 -38.2 1982-10-07
## 2    197 -60.5 -38.4 1986-09-26
## 3    262 -58.9 -38.4 1981-11-05
## 4   2700 -58.5 -37.7 1970-02-01
## 5   2702 -58.0 -37.8 1970-03-21
## 6   2705 -57.9 -37.9 1970-03-24
## [... more sites ...]
## 
## Spatial Data:
## [EMPTY]
## 4.3 - plot first 20 profiles using CEC as color ------------------------------
plotSPC(x = profiles[1:10], name = "cec", color = "cec",
        name.style = "center-center")

## 4.4 - check data integrity --------------------------------------------------
# A valid profile is TRUE if all of the following criteria are false:
#    + depthLogic : boolean, errors related to depth logic
#    + sameDepth : boolean, errors related to same top/bottom depths
#    + missingDepth : boolean, NA in top / bottom depths
#    + overlapOrGap : boolean, gaps or overlap in adjacent horizons
profile_check <- aqp::checkHzDepthLogic(profiles)
head(profile_check)
##   ProfID valid depthLogic sameDepth missingDepth overlapOrGap
## 1    154  TRUE      FALSE     FALSE        FALSE        FALSE
## 2    197  TRUE      FALSE     FALSE        FALSE        FALSE
## 3    262  TRUE      FALSE     FALSE        FALSE        FALSE
## 4   2700  TRUE      FALSE     FALSE        FALSE        FALSE
## 5   2702  TRUE      FALSE     FALSE        FALSE        FALSE
## 6   2705  TRUE      FALSE     FALSE        FALSE        FALSE
# visualize some of these profiles by the pid
subset(profiles, grepl(6566, ProfID, ignore.case = TRUE))
## SoilProfileCollection with 1 profiles and 4 horizons
## profile ID: ProfID  |  horizon ID: hzID 
## Depth range: 80 - 80 cm
## 
## ----- Horizons (4 / 4 rows  |  10 / 11 columns) -----
## # A tibble: 4 × 10
##   ProfID hzID    top bottom HorID    ph     k   soc  clay    bd
##    <dbl> <chr> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   6566 643       0     16 29463   9.4   1.5  2.02   9.9    NA
## 2   6566 644      16     25 29464   9.5   1.2  0.91  10.7    NA
## 3   6566 645      26     55 29465   9.1   2.6  0.42  34.4    NA
## 4   6566 646      55     80 29466   9     2.2  0.19  24.6    NA
## 
## ----- Sites (1 / 1 rows  |  4 / 4 columns) -----
## # A tibble: 1 × 4
##   ProfID     x     y date      
##    <dbl> <dbl> <dbl> <date>    
## 1   6566 -58.2 -37.9 1970-04-07
## 
## Spatial Data:
##      [,1]
## [1,]   NA
## CRS:  NA
subset(profiles, grepl(6915, ProfID, ignore.case = TRUE))
## SoilProfileCollection with 1 profiles and 7 horizons
## profile ID: ProfID  |  horizon ID: hzID 
## Depth range: 140 - 140 cm
## 
## ----- Horizons (6 / 7 rows  |  10 / 11 columns) -----
## # A tibble: 6 × 10
##   ProfID hzID    top bottom HorID    ph     k   soc  clay    bd
##    <dbl> <chr> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   6915 864       0      6 31278   6     2.8 15.8   30.2    NA
## 2   6915 865       6     14 31279   6.7   1.8  5.65  33.8    NA
## 3   6915 866      14     21 31280   7.5   0.9  2.02  19.8    NA
## 4   6915 867      21     35 31281   7.6   1    0.45  13.8    NA
## 5   6915 868      35     65 31282   8.2   2.3  0.41  38.3    NA
## 6   6915 869      65     87 31283   8.5   2.7  0.23  40      NA
## [... more horizons ...]
## 
## ----- Sites (1 / 1 rows  |  4 / 4 columns) -----
## # A tibble: 1 × 4
##   ProfID     x     y date      
##    <dbl> <dbl> <dbl> <date>    
## 1   6915 -58.2 -37.6 1971-03-19
## 
## Spatial Data:
##      [,1]
## [1,]   NA
## CRS:  NA
subset(profiles, grepl(7726, ProfID, ignore.case = TRUE))
## SoilProfileCollection with 1 profiles and 5 horizons
## profile ID: ProfID  |  horizon ID: hzID 
## Depth range: 155 - 155 cm
## 
## ----- Horizons (5 / 5 rows  |  10 / 11 columns) -----
## # A tibble: 5 × 10
##   ProfID hzID    top bottom HorID    ph     k   soc  clay    bd
##    <dbl> <chr> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   7726 1581      0     27 35695   6.5   1.2 19     25.5    NA
## 2   7726 1582     27     38 35696   6.6   1.4  1.97  30      NA
## 3   7726 1583     38     70 35697   6.9   1.4  0.83  39.1    NA
## 4   7726 1584     70     97 35698   7.2   1.4  0.2   23.8    NA
## 5   7726 1585     97    155 35699   7.3   1.2 NA     14.4    NA
## 
## ----- Sites (1 / 1 rows  |  4 / 4 columns) -----
## # A tibble: 1 × 4
##   ProfID     x     y date      
##    <dbl> <dbl> <dbl> <date>    
## 1   7726 -58.8 -37.3 1971-02-01
## 
## Spatial Data:
##      [,1]
## [1,]   NA
## CRS:  NA
## 4.5 - keep only valid profiles ----------------------------------------------
clean_prof <- HzDepthLogicSubset(profiles)
## dropping profiles with invalid depth logic, see `metadata(x)$removed.profiles`
metadata(clean_prof)$removed.profiles
## [1] 6566 7410 8002
# write_rds(clean_prof, "01-Data/soilProfileCollection.rds")

## 4.6 convert soilProfileCollection to a table --------------------------------
dat <- left_join(clean_prof@site, clean_prof@horizons)
## Joining, by = "ProfID"
dat <- select(dat, ProfID, HorID, x, y, date, top, bottom, ph:cec )

The soil data table is now revised and a sound and consistent quality is ensured. Thus, the available data can be used in the following to perform calculations in order to account for missing soil properties for instance. In the following example, a set of pedotransfer functions (PTF) will be calculated based on the organic matter (OM) content in order to estimate bulk density (BD) for missing observations.

5.5 Calculation of pedo-transfer functions

In the cases of single-layer samples, which is common in sampling for nutrient determination, a locally calibrated pedotransfer function (PTF) should be applied. PTF will be also required to harmonise the laboratory methods. Experts from GLOSOLAN will provide advice in this regard.

Therefore, a customised function is introduced to our working environment. Users can write their own functions in R. This is often necessary when existing functions need to be customised or very specific calculations need to be performed. Functions greatly increase the efficiency of our code. For further information, it is recommendable to consult online resources on the topic (e.g. https://hbctraining.github.io/Intro-to-R/lessons/03_introR-functions-and-arguments.html).

The function ‘estimateBD’ below calculates various PTFs that estimate BD. Which equation is used is determined by the user that has to choose one of the methods and also specify the SOC value of the respective horizon. The SOC values is first converted to OM by using the conversion factor of 1.724 and then inserted in the respective PTF. The return() command tells R which value to output.

# 5 - Estimate BD using pedotransfer functions =================================

# create the function with all PTF
estimateBD <- function(SOC=NULL, method=NULL){
  OM <- SOC * 1.724
  if(method=="Saini1996"){BD <- 1.62 - 0.06 * OM}
  if(method=="Drew1973"){BD <- 1 / (0.6268 + 0.0361 * OM)}
  if(method=="Jeffrey1979"){BD <- 1.482 - 0.6786 * (log(OM))}
  if(method=="Grigal1989"){BD <- 0.669 + 0.941 * exp(1)^(-0.06 * OM)}
  if(method=="Adams1973"){BD <- 100 / (OM /0.244 + (100 - OM)/2.65)}
  if(method=="Honeyset_Ratkowsky1989"){BD <- 1/(0.564 + 0.0556 * OM)}
  return(BD)
}

To apply the ‘estimateBD’ function, first a test dataframe is created that includes the SOC values from the cleaned profile table as well as the respective existing BD measurements. The rows without values in one of the columns are excluded using the na.omit() function since we want to first evaluate the difference between estimated BDs and measured BDs. Now, the test dataframe is complemented by the estimated BDs derived from the PTFs for each method. To add new columns to an existing dataframe one has to write on the left-hand side of the arrow the name of the existing dataframe object (in this case BD_test), the dollar sign ($), and the name of the new column. Here, the names are given according to the used BD PTF.

## 5.1 - Select a pedotransfer function ----------------------------------------
# create a vector of BD values to test the best fitting pedotransfer function
BD_test <- tibble(SOC = clean_prof@horizons$soc,
                  BD_test = clean_prof@horizons$bd)
BD_test <-  na.omit(BD_test)

## 5.2 - Estimate BLD for a subset using the pedotransfer functions ------------
BD_test$Saini <- estimateBD(BD_test$SOC, method="Saini1996")
BD_test$Drew <- estimateBD(BD_test$SOC, method="Drew1973")
BD_test$Jeffrey <- estimateBD(BD_test$SOC, method="Jeffrey1979")
BD_test$Grigal <- estimateBD(BD_test$SOC, method="Grigal1989")
BD_test$Adams <- estimateBD(BD_test$SOC, method="Adams1973")
BD_test$Honeyset_Ratkowsky <- estimateBD(BD_test$SOC,
                                         method="Honeyset_Ratkowsky1989")

The calculated BDs can now be compared using the summary() function. However, a faster and more accessible approach is to plot the different bulk densities for comparison. In case you are not familiar with the plot() function and its respective commands, it is recommendable to check one of the many online learning resources such as https://intro2r.com/simple-base-r-plots.html. The plot shows us both measured and estimated BD values as differently coloured lines (see Fig ???).

## 5.3 Compare results ---------------------------------------------------------

# Observed values:
summary(BD_test$BD_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.87    1.16    1.26    1.26    1.38    1.49
# Predicted values:
summary(BD_test$Saini)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.289   1.408   1.505   1.473   1.542   1.574
summary(BD_test$Drew)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.211   1.326   1.437   1.406   1.485   1.529
summary(BD_test$Jeffrey)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3231  0.6253  1.0416  1.0176  1.3076  1.6695
summary(BD_test$Grigal)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.345   1.430   1.508   1.485   1.540   1.568
summary(BD_test$Adams)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.716   1.965   2.229   2.164   2.350   2.466
summary(BD_test$Honeyset_Ratkowsky)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.148   1.315   1.492   1.448   1.573   1.650
# Compare data distributions for observed and predicted BLD
plot(density(BD_test$BD_test),type="l",col="black", ylim=c(0,5),
     lwd=2, main="Bulk Density Pedotransfer Functions")
lines(density(BD_test$Saini),col="green", lwd=2)
lines(density(BD_test$Drew),col="red", lwd=2)
lines(density(BD_test$Jeffrey),col="cyan", lwd=2)
lines(density(BD_test$Grigal),col="orange", lwd=2)
lines(density(BD_test$Adams),col="magenta", lwd=2)
lines(density(BD_test$Honeyset_Ratkowsky),col="blue", lwd=2)
legend("topleft",
       legend = c("Original", "Saini", "Drew", "Jeffrey", "Grigal", "Adams",
                  "Honeyset_Ratkowsky"),
       fill=c("black", "green", "red", "cyan", "orange","magenta", "blue"))

# Plot the Selected function again
plot(density(BD_test$BD_test),type="l",col="black", ylim=c(0,3.5),
     lwd=2, main="Bulk Density Selected Function")
lines(density(BD_test$Honeyset_Ratkowsky),col="blue", lwd=2)
legend("topleft",legend = c("Original", "Honeyset_Ratkowsky"),
       fill=c("black", "blue"))

The PTF to be chosen for estimating the BD of the missing horizons should be the closest to the measured BD values. Once, the appropriate PTF was chosen, the ‘estimateBD’ function is applied in the dataframe ‘dat’ that was created at the end of the quality check. Here, new bd values are estimated for the rows in which the column ‘bd’ has missing values. Finally, a plot is generated to visualize the gap-filled bulk density values.

## 5.4 Estimate BD for the missing horizons ------------------------------------
dat$bd[is.na(dat$bd)] <-
  estimateBD(dat[is.na(dat$bd),]$soc, method="Honeyset_Ratkowsky1989")

# Explore the results
summary(dat$bd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.4192  1.2624  1.5797  1.4773  1.7008  1.7670     356
plot(density(BD_test$BD_test),type="l",col="black", ylim=c(0,3.5),
     lwd=2, main="Bulk Density Gap-Filling")
lines(density(dat$bd, na.rm = TRUE), col="green", lwd=2)
legend("topleft",legend = c("Original", "Original+Estimated"),
       fill=c("black", "green"))

5.6 Check for outliers

Unrealistically high or low values can have considerable impact on the statistical analysis and thus it is key to identify and carefully check those values in order to get valid results and eliminate potential bias. Again, the summary() function is apt to show general descriptive statistics such as maxima or minima. Based on this assessment, more detailed views of the suspicious values can be obtained by filtering values above or below a certain threshold as done in the code below for soil organic carbon (SOC) values above 10 percent. If such values don’t belong to soil types that would justify such exceptionally high SOC values, e.g. organic soils (Histosols), these rows can be removed based on the profile ID. The same process should be repeated for all soil properties. Such evaluation can also be conducted visually for several properties at the same time using the ‘tidyverse’ and ‘ggplot’ package that allows to plot boxplots for several soil properties at the same time. To get more information on tidyverse, please follow this link: https://r4ds.had.co.nz/. For a comprehensive overview of the functionalities of ggplot, a more sophisticated way of plotting, this book provides a good overview: http://www.cookbook-r.com/Graphs/.

## 5.5 - Explore outliers ------------------------------------------------------
# Outliers should be carefully explored and compared with literature values.
# Only if it is clear that outliers represent impossible or highly unlikely 
# values, they should be removed as errors.
# 
# Carbon content higher than 15% is only typical for organic soil (histosols)
# We will remove all atypically high SOC as outliers
summary(dat$soc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.020   0.250   0.720   1.451   2.340  19.000     356
outliers <- na.omit(dat$ProfID[dat$soc > 10])
head(outliers)
## [1] 6915 7726
dat <- dat[dat$ProfID != 6915,]
dat <- dat[dat$ProfID != 7726,]

# Explore bulk density data, identify outliers
# remove layers with Bulk Density < 1 g/cm^3
low_bd_profiles <- na.omit(dat$ProfID[dat$bd<1])
dat <- dat[!(dat$ProfID %in% low_bd_profiles),]

# Explore data, identify outliers
x <- pivot_longer(dat, cols = ph:cec, values_to = "value",
                  names_to = "soil_property")
x <- na.omit(x)
ggplot(x, aes(x = soil_property, y = value, fill = soil_property)) +
  geom_boxplot() + 
  facet_wrap(~soil_property, scales = "free")

5.7 Harmonise soil layer depths

The last step towards a soil data table that can be used for mapping, is to harmonize the soil depth layers to 0-30 cm (or 30-60, or 60-100 cm respectively). This is necessary since we want to produce maps that cover exactly those depths and do not differ across soil profile locations. Thus, the relevant columns are selected from the dataframe, target soil properties, and upper and lower limit of the harmonised soil layer are specified (in depths).

In the following a new dataframe ‘d’ is created in which the standard depth layers are stored. The code below shows a for loop that calculates the values for the standard depth for each target soil property automatically using the ea_spline function of the ‘ithir’ package. The spline functions are also explained in Format requirements of soil data.

# 6 - Harmonize soil layers ====================================================
## 6.1 - Set target soil properties and depths ---------------------------------
names(dat)
dat <- select(dat, ProfID, HorID, x, y, top, bottom, ph, k, soc, clay, bd, cec)

target <- c("ph", "k", "soc", "clay", "bd", "cec")
depths <- t(c(0,30))

## 6.2 - Create standard layers ------------------------------------------------
d <- unique(select(dat, ProfID, x, y))

for (i in seq_along(target)) {
  vlow <- min(dat[,target[i]][[1]], na.rm = TRUE)
  vhigh <- max(dat[,target[i]][[1]], na.rm = TRUE)
  o <- dat[,c("ProfID", "top", "bottom",target[i])] %>% 
    na.omit() %>%
    as.data.frame(stringsAsFactors = FALSE)
  x <- ithir::ea_spline(obj = o, var.name = target[i], d = depths, 
                        vlow = vlow[[1]], vhigh = vhigh[[1]])$harmonised 
  x[x==-9999] <- NA
  x <- x %>% 
    as_tibble() %>% 
    select(-`soil depth`)
  
  names(x) <- c("ProfID",paste0(target[i],c("_0_30","_30_60","_60_100")))
  d <- d %>% left_join(x, by = "ProfID" )
}
d

5.8 Save the results

Before finalising the soil data preparation, it is recommendable to check again visually if the calculations were conducted correctly. Again, the combination of tidyverse and ggplot functions provides high efficiency and versatility to visualise figures with the desired soil properties. At last, the write_csv() function is used to save the dataframe as a .csv file in the Outputs folder (02-Outputs). With this, the soil data preparation is finalised.

# 7 - Plot  and save results ===================================================

x <- pivot_longer(d, cols = ph_0_30:cec_0_30, values_to = "value",
                  names_sep = "_", 
                  names_to = c("soil_property", "top", "bottom"))
x <- mutate(x, depth = paste(top, "-" , bottom))
x <- na.omit(x)
ggplot(x, aes(x = depth, y = value, fill = soil_property)) +
  geom_boxplot() + 
  facet_wrap(~soil_property, scales = "free")

# remove BD and CF
# d <- select(d, ProfID:y, soc_0_30:ocs_60_100)
# save data
write_csv(d, "02-Outputs/spline_soil_profile.csv")

References

Beaudette, D.E., Roudier, P. & O’Geen, A. 2013. Algorithms for quantitative pedology: A toolkit for soil scientists. Computers & Geosciences, 52: 258–268.
Bishop, T.F.A., McBratney, A.B. & Laslett, G.M. 1999. Modelling soil attribute depth functions with equal-area quadratic smoothing splines, 91: 27–45. https://doi.org/10.1016/s0016-7061(99)00003-8
Malone, B.P., McBratney, A.B., Minasny, B. & Laslett, G.M. 2009. Mapping continuous depth functions of soil carbon storage and available water capacity, 154: 138–152. https://doi.org/10.1016/j.geoderma.2009.10.007
Olmedo, G., Rodriguez, D. & Angelini, M. 2017. Advances in digital soil mapping and soil information systems in argentina. GlobalSoilMap, pp. 13–16. CRC Press.
Sainz Rozas, H.R., Echeverria, H.E. & Angelini, H.P. 2011. Niveles de carbono orgánico y pH en suelos agrı́colas de las regiones pampeana y extrapampeana argentina. Ciencia del suelo, 29(1): 29–37.
Sainz Rozas, H.R., Eyherabide, M., Larrea, G.E., Martinez Cuesta, N., Angelini, H.P., Reussi Calvo, N.I. & Wyngaard, N. 2019. Relevamiento y determinación de propiedades quı́micas en suelos de aptitud agrı́cola de la región pampeana. Fertilizar, Argentina, 8(9): 12.